Having been around since the project's inception, I can say with confidence that the 1.1.0 release of RHQ is the most robust, stable, and scalable version we've delivered to date.
The major focus of this release was to eliminate the single point of failure on the server-side. In previous versions, if the RHQ Server ever went down - whether for some scheduled maintenance on the box, a network blip between the agents, unanticipated firewall rule changes, hardware failures, power outages, etc - the system would appear down. You wouldn't be able to access the web console, the data the agents were collecting couldn't get to the database, and you wouldn't receive any alerts that the RHQ system wasn't functioning (because the mechanism for sending alerts wasn't running).
All that changes in the 1.1.0 release. With it comes the high availability and failover feature set. The first part, high availability, seeks to provide redundancy at the server-side - the layer that collects data from the agents, inserts it into the database, runs periodic jobs, triggers alerts, and provides the web console. The second part, failover, enables agents to switch which server they are communicating with so that collected data can make it into the database in a timely manner.
Let me jump the gun and say that even though we weren't explicitly planning on better scale through increased throughput, that's serendipitously what happened. At the start of this release, we thought that the ability to monitor 100 agents simultaneously (with default metric collection intervals) would be an improvement we could be proud of. So then it should come as no surprise why I couldn't stop smiling when we ramped the system up to ~350 agents...and it kept humming. And we weren't just collecting data and using the server as a pass through to the database; the system had thousands of alert definitions set up and every single report sent up from any agent had to first be inspected by the alerting engine, to see if it should fire off any alerts.
So is that the only thing the team was working on these last 3 months? Hardly. When I take a step back and reflect on everything else that was accomplished this release, I couldn't be more proud to work with such a capable team of engineers that could knock out such a formidable amount of work given the timeframes we had. We closed out more than 200 issues, which is nearly a quarter of all that had been opened to date!
And in order to share with the community what 1.1.0 has to offer, I took the time to go through every single one of them today, and tried to come up with a short list to describe all the major points that this release has to offer.
Platform Improvements
Java 6 Support
Improved inventory management around manually added resources
Better synchronization of state between server and agents
Tolerance around failures during auto-discovery
Auditing
Provide more information through alert history details
Make all alerting data available as an SNMP trap
Enhanced audit trail around software packages
Plugin Enhancements
IIS - start/stop operations, response time logging
RHQ-Agent: data & control around high-availability features
Platform: expanded set of monitorable metrics
JMX: support adding custom JMX endpoints / monitoring of stand-alone JVMs
Apache: better handling around server operations
Notable Features
UI Wizard for dynagroups / group definitions
Offline updates of metric schedules (agents can be down)
Alert template changes more easily cascade to children
Performance Improvements
Decreased startup time significantly
Manipulating membership of very large recursive groups
Alerts engine & periodic baseline calculations
General UI Enhancements
Inventory hierarchy better expressed by showing parent relationship
Improved error handling around manually created resources
Tabular data display more consistently show most recent line items
And countless other minor improvement to usability
Now that the platform is stable and, more importantly, that we know we can scale to hundreds of agents (monitoring tens of thousands of managed resources), it's time to focus on how we can enable RHQ to ease the management of large environments. And I wouldn't be surprised if that was a primary focus for the 1.2.0 release. Stay tuned!
Joseph Marques
Technical Lead, RHQ Project
Senior Software Engineer, Red Hat